1 Introduction

Energy consumption is a critical concern worldwide due to its impact on the environment, economy, and human welfare. Therefore, understanding the factors that influence energy consumption in buildings is essential to optimize energy use and minimize its negative effects. Multiple linear regression is a statistical method used to model the relationship between a dependent variable and several independent variables simultaneously. In this report, we perform a multiple linear regression analysis to investigate the factors that affect energy consumption. The analysis is based on a dataset that includes information on natural gas consumption and several variables related to weather conditions (such as the mean external temperature and the irradiance). The objective of this study is to identify the significant predictors of energy consumption and provide insights into the underlying mechanisms that drive energy use.

The report is organized in the following sections:

  • Dataset:
  • Outlier detection: …
  • Multiple linear regression model: …
  • Conclusion: …

2 Dataset

The dataset utilized in this analysis is composed by 3 numerical variables, total daily gas consumption Energy \([Smc]\), mean daily external temperature Text \([°C]\), and mean solar irradiance Iext \([W/m^2]\) and 1 categorical variable, the day of the week DayofWeek.

The dataset provides daily measurements of these variables for a full heating season in Turin, which goes from \(1^{st}\) November to \(31^{th}\) March, resulting in a total of 151 records.

In the table below is reported a sketch of the dataset.

The trend of the variables during the heating season is represented in the figure below.

Will be useful for the further steps to summarize the dataset in terms of statistical quantities and distributions:

##       date             DayOfTheWeek       Text             Iext       
##  Min.   :2017-11-01   Min.   :1.00   Min.   :-5.950   Min.   :  0.50  
##  1st Qu.:2017-12-08   1st Qu.:2.00   1st Qu.:-0.115   1st Qu.:  3.48  
##  Median :2018-01-15   Median :4.00   Median : 2.920   Median : 34.34  
##  Mean   :2018-01-15   Mean   :4.04   Mean   : 3.103   Mean   : 41.23  
##  3rd Qu.:2018-02-21   3rd Qu.:6.00   3rd Qu.: 6.605   3rd Qu.: 71.47  
##  Max.   :2018-03-31   Max.   :7.00   Max.   :11.610   Max.   :182.10  
##      Energy        day_name        
##  Min.   :  0.0   Length:151        
##  1st Qu.:257.1   Class :character  
##  Median :389.2   Mode  :character  
##  Mean   :382.2                     
##  3rd Qu.:556.2                     
##  Max.   :676.8

3 Outlier detection

In this section, an outlier detection process is employed with the aim to identify possible values of the variables analyzed that can be consider far enough from the distribution of data and that can lead to incorrect or misleading conclusions when developing a multiple regression model.

In this case one way to operate could be the use of the Cook’s distance, which is a measure of the influence of each observation on a regression analysis. It can be used to identify multivariate outliers in non-normal distributions, like ours, by examining the values of Cook’s distance for each observation. Large values of Cook’s distance indicate observations that are having a disproportionate influence on the regression analysis, which could be due to being outliers.

Cook’s distance is evaluated as:

\[D_i = \frac{\sum_{j=1}^n (\hat{y_j} - \hat{y_{j(i)}})}{ps^2}\] where \(\hat{y_j}\) is the predition of the mean using the j observation and \(\hat{y_{j(i)}}\) is the prediction of the mean without the i-observation, \(s^2\) is the mean square error and \(p\) is the number of independent variables.

To better visualize the dataset, a 3D scatter plot is reported in the figure below, coloring in different ways the \(DayoftheWeek\).

As we can easily seen, Sundays are day of the week where there is no energy consumption, so can be easily eliminated from the model to improve the accuracy.

Now we can perform the model and evaluate the Cook’s distance:

A thumb’s rule using the Cook’s distance to outlier detection is considering a threshold value of \(4/n\), where \(n\) is the number of observations (130). So, records with Cook’s distance higher than \(0.031\) are considered outliers and eliminated from the model to make it more accurate.

Let’s plot the Cook’s distances and the threshold identified:

How we can see, 4 outliers have been identified using this metric, which are 2017-11-13, 2018-01-11, 2018-02-20, 2018-02-23.

Now we can eliminate these data and re-perform the linear regression model, evaluating its performance.

4 Multiple regression model

Once data are cleaned, it is possible to perform a linear regression model using the external temperature \(T_{ext}\) and \(I_{ext}\).

## 
## Call:
## lm(formula = Energy ~ Text + Iext, data = data_final)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -88.589 -35.919  -9.473  38.182  92.225 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 567.6156     6.4319  88.250  < 2e-16 ***
## Text        -33.2272     0.9738 -34.123  < 2e-16 ***
## Iext         -0.5605     0.1072  -5.228 7.12e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.35 on 123 degrees of freedom
## Multiple R-squared:  0.909,  Adjusted R-squared:  0.9076 
## F-statistic: 614.6 on 2 and 123 DF,  p-value: < 2.2e-16

How can be easily observed, the regression model employed yields a robust result. In fact we can observe the following features:

  • The \(R^2\) takes an high value, 0.9090416, which is very similar to the adjusted \(R^2\), 0.9075626, which means that the predictors used in the model are not redundant and explain very well the variance in the data.
  • The same observation can be done looking at the p-value of the coefficients, which all have an high statistical significance.
  • The F statistic, which is used to perform a model utility test through considering a F distribution, is large, meaning that the null hypothesis is rejected and there is a useful linear relationship between \(Energy\) and the other predictors.

For completeness we show also some plot metric used to visualize the strenght of the regression model.

5 Conclusion